This is a demo for the Canopy2 package in R. Canopy2 is a statistical and computational method for tumor phylogeny inference using single-nucleotide variants derived from bulk DNA and single-cell RNA sequencing. Canopy2 samples from a joint probability distribution involving a mixture of a binomial and beta-binomials, specifically chosen to account for the sparsity and stochasticity of the single-cell data.
To use this package, the following inputs are required:
Rb and Xb, alternative (mutated loci) and total (mutated + benign loci) read counts from bulk DNA
whole exome sequencing (WES) data. Both have dimension \(M\) (mutations) \(\times\) \(S\) (bulk samples).
A sample pipeline that can be used to preprocess these data is located in our Zenodo repository at https://zenodo.org/record/7931384.
Rs and Xs, alternative (mutated loci) and total (mutated + benign loci) read counts from single-cell
RNA sequencing (scRNA-seq) data. Both have dimension \(M\) (mutations) \(\times\) \(N\) (single-cells). A sample pipeline that can be used to preprocess these data is located in our Zenodo repository at https://zenodo.org/record/7931384.
alpha and beta, estimates of the gene activation rates and gene deactivation rates using gene expression data from scRNA-seq. These are mutation-specific and of dimension \(1 \times M\). One can use function get_burstiness_bpsc() or get_burstiness_scale() to estimate alpha and beta.
kappa and tau, hyperparameters used to estimate the sequencing error. To date (year 2023), the average error rate of next-generation sequencing is reported to be 0.1% per nucleotide (Broeckx et al. (2017)), so we set the hyperparameters to default values of kappa=1 and tau=999 so that the sequencing error rate,
\(\epsilon\) (assumed to follow a beta distribution) has mean \(\kappa/(\kappa +\tau) = 0.001\).
Of note: the notation for the number of bulk samples differs between the text and code. In the text, the number of bulk samples is denoted by \(T\), while in the code it is denoted by \(S\) (since \(T\) stands for TRUE in R and can cause conflict).
We will first demonstrate use of Canopy2 in a simulation setting. Then, we will apply Canopy2 to a case study from a patient with glioblastoma (GBM), a cancer of the brain.
Canopy2 is available on CRAN (https://cran.r-project.org/web/packages/canopy2/index.html):
install.packages('canopy2')
A developmental version can be installed from GitHub (https://github.com/annweideman/canopy2):
if (!require("devtools")) install.packages("devtools")
library(devtools)
install_github("annweideman/canopy2") # STRONGLY recommended
Let us assume we have been provided with two bulk DNA WES samples (S=2) and a small number of single-cells (N=30) and point mutations (M=5). Additionally, let the true number of subclones (tips of the phlogenetic tree including the normal clone) be equal to four (ktrue=4).
The mutation-specific gene activation rate, \(\alpha_m\), and deactivation rate, \(\beta_m\), are chosen to be close those observed in vitro, i.e., \(\alpha_m << \beta_m\). The average error rate of next-generation sequencing is reported to be 0.1% per nucleotide (Broeckx et al. (2017)), so we set the sequencing error hyperparameters to default values of kappa=1 and tau=999 so that the sequencing error rate, \(\epsilon\) (assumed to follow a beta distribution) has mean \(\kappa/(\kappa +\tau) = 0.001\)
We utilize a bulk sequencing depth of 30x (b.mindepth=30) to 50x (b.maxdepth=50) and a single-cell sequencing depth of 80x (sc.mindepth=80) to 120x (sc.maxdepth=120). Lower depths, such as these, are more standard in practice as most laboratories have limited budgets. We expect inaccuracies in variant discovery when utilizing lower coverage and suggest that the user experiment with a wide range of sequencing depths when testing these simulations.
library(canopy2)
# Set parameter values for simulation from beta-binomial
N=30; S=2; M=5; Ktrue=4 #number of single cells, bulk samples, mutations, clones
alpha=0.1; beta=1 #gene activation and deactivation rates, treated as identical across all genes for simulation purposes
scale=300 #scaling factor for beta distribution: rate at which DNA transcribed into RNA
kappa=1; tau=999 #parameters for sequencing error
b.mindepth=30; b.maxdepth=50 #bulk min and max sequencing depth
sc.mindepth=80; sc.maxdepth=120 #single-cell min and max sequencing depth
scale=300 #scaling factor for beta distribution: rate at which DNA transcribed into RNA
The MCMC will be run for 5,000 iterations (niter=5000) across 5 chains (nchains=5), although in practice you will want to increase the number of iterations to at least 50k and the number of chains to at least 10. To ensure that the values of the joint posterior are not highly correlated and are evaluated after convergence has occurred, only every 20th value is stored (thin=20) and 20% burn-in (pburn=0.2) is removed. A range of possible subclones between 3 and 5 (subclones=3:5) are evaluated, although we know the truth is 4 subclones (ktrue=4).
To simulate data from a beta-binomial distribution, the mutation-specific gene activation rates, \(\alpha_m\), and gene deactivation rates, \(\beta_m\), will need to first be estimated from simulated gene expression data using the param.est function. The gene expression data is simulated from a beta-Poisson distribution in the simulate_data() function; the methodology behind this is discussed further in the text.
seed=8675309 # seed to start random number generation
# Simulate data from a beta-binomial distn
sims.out<-simulate_data(N=N, S=S, M=M, alpha=alpha, beta=beta, kappa=kappa,
tau=tau, Ktrue=Ktrue, b.mindepth=b.mindepth,
b.maxdepth=b.maxdepth, sc.mindepth=sc.mindepth,
sc.maxdepth=sc.maxdepth, scale=scale, seed=seed)
After simulating read counts using the simulate_data() function, the vectors of mutation-specific hyperparameters, alpha and beta, which represent the gene activation and deactivation rates, can be estimated empirically using the gene in which each point mutation resides. The Canopy2 package provides two functions, get_burstiness_bpsc() and get_burstiness_scale(), to estimate these hyperparameters. The BPSC routine depends on MCMC routines, whereas, the SCALE routine utilizes moment estimators. The BPSC methodology is appropriate for gene expression datasets of small to moderate size and was shown to outperform the SCALE methodology in estimation (see manuscript Supplemental Material). For large datasets, we recommend trying the BPSC methodology first, and, if too computationally intensive, we suggest employing the SCALE methodology instead.
# Estimate parameters for gene kinetics using the BPSC methodology
# Note: can also use get_burstiness_scale() for large datasets
param.out<-get_burstiness_bpsc(counts=sims.out$G)
#gene activation rate
param.out$alpha
#> [1] 0.06570306 0.05952492 0.08785977 0.06181939 0.05914870
#gene deactivation rate
param.out$beta
#> [1] 1.2852380 0.7088066 2.1700027 5.4015547 1.9148217
While not relevant for this simulation, in the real data there will generally be some mutations for which bursting kinetics cannot be estimated due to an abundance of zero read counts. Thus, after \(\alpha\) and \(\beta\) are estimated, the alternative and total read counts for both the bulk and single cell data must be subsetted to include only those mutations with estimable \(\alpha_m\) and \(\beta_m\), where subscript \(m\) denotes the mutation. For demonstration, we will include this chunk of code, although the number of rows in the read count matrices (\(R^S\), \(R^B\), \(X^S\), \(X^B\)) will remain unchanged.
# Row numbers corresponding to estimable gene kinetics
param.out$id.g
#> [1] 1 2 3 4 5
# Subset the read counts to include only those mutations with estimable gene
# kinetics
Rs<-sims.out$Rs[param.out$id.g,] # single cell alternative read counts
Xs<-sims.out$Xs[param.out$id.g,] # single cell total read counts
Rb<-sims.out$Rb[param.out$id.g,] # bulk alternative read counts
Xb<-sims.out$Xb[param.out$id.g,] # bulk total read counts
The estimates for \(\alpha\) and \(\beta\) are passed to get_trees(), which outputs trees, posteriors, and Metropolis-Hastings acceptance rate associated with \(K\) subclones. The output from get_trees() will consist of a list of lists of length nchains * K where each list corresponds to a specific chain and subclone.
# Specify arguments for MCMC
nchains<-5 # number of chains
pburn<-0.2 # percent burn-in to remove
thin<-20 # save every 20th iteration (discard 1-19)
niter<-5000 # number of iterations of MCMC
Klist=3:5 # range of subclones to try
ncores=10 #number of cores
# Produce trees, posteriors and Metropolis-Hastings acceptance rate associated
# with K subclones
sim.trees<-get_trees(Rs=Rs, Rb=Rb, Xs=Xs, Xb=Xb,
alpha=param.out$alpha, beta=param.out$beta,
kappa=kappa, tau=tau, Klist=Klist, niter=niter,
nchains=nchains, thin=thin, pburn=pburn,
ncores=ncores, seed=seed)
#> [1] "Executing code in parallel using 10 cores."
# For K=3 and the first chain
sim.trees$samples[[1]]$K
#> [1] 3
sim.trees$samples[[1]]$tree[[1]] #print only first tree from first iteration
#>
#> Phylogenetic tree with 3 tips and 2 internal nodes.
#>
#> Tip labels:
#> 1, 2, 3
#>
#> Rooted; no branch lengths.
tail(sim.trees$samples[[1]]$posteriors)
#> [1] -178.0767 -184.4751 -182.9802 -176.7022 -179.5247 -178.8971
sim.trees$samples[[1]]$`acceptance rate`
#> [1] 0.4280247
The trees and posteriors obtained from get_trees() can be used to produce diagnostic plots to investigate the lag autocorrelation factor (lag ACF), posterior densities, and trace plots. Recall, we want the lag ACF, or the correlation between adjacent posterior values, to be low. Additionally, the posterior densities should align in space, and the trace plots should appear as random noise.
get_diagnostics(get.trees.out=sim.trees)
Finally, get_best_tree() is used to produce the optimal configuration of the phylogenetic tree by using a modification of the Bayesian information criterion (BIC) as defined in the documentation.
The tree corresponding to the minimum BIC is that with 4 subclones which coincides with the true number of subclones.
sim.best.tree<-get_best_tree(get.trees.out=sim.trees)
#> [1] "k=3; mean of maximized log-posteriors across all chains=-164.31; BIC=371.78"
#> [1] "k=4; mean of maximized log-posteriors across all chains=-164.31; BIC=354.19"
#> [1] "k=5; mean of maximized log-posteriors across all chains=-164.31; BIC=357.47"
sim.best.tree
#> $K
#> [1] 4
#>
#> $tree
#>
#> Phylogenetic tree with 4 tips and 3 internal nodes.
#>
#> Tip labels:
#> 1, 2, 3, 4
#>
#> Rooted; no branch lengths.
#>
#> $branches.muts
#> [,1]
#> [1,] "M1: snv3"
#> [2,] "M2: snv4"
#> [3,] "M3: snv2, snv5"
#> [4,] "M4: snv1"
#>
#> $clonal.muts
#> [,1]
#> Clone:1 "None"
#> Clone:2 "snv3"
#> Clone:3 "snv2, snv4, snv5"
#> Clone:4 "snv1, snv4"
#>
#> $posteriors
#> [1] -161.7297 -162.5213 -164.0343 -161.5074 -160.3348 -162.1756 -160.3998
#> [8] -161.3883 -162.8927 -164.9583 -162.6257 -159.7299 -161.8138 -160.2135
#> [15] -164.5430 -162.1760 -161.8490 -163.4764 -161.0123 -159.3217 -160.9657
#> [22] -160.1034 -161.9612 -165.1032 -176.3758 -175.6727 -160.8985 -163.8514
#> [29] -157.7094 -160.1265 -159.8953 -163.2459 -166.2495 -163.0461 -165.0351
#> [36] -168.8975 -167.4663 -164.0833 -161.4062 -163.8341 -164.6413 -162.1618
#> [43] -162.3025 -160.2973 -161.8063 -160.4476 -162.5734 -161.4314 -161.9688
#> [50] -162.3305 -165.1859 -166.8790 -159.7710 -159.3259 -161.6834 -158.8527
#> [57] -167.2227 -168.6026 -163.7040 -158.8220 -163.3746 -161.5546 -162.7620
#> [64] -168.0713 -162.6641 -161.0400 -161.2132 -162.5729 -165.6891 -167.2685
#> [71] -163.2221 -167.4647 -160.9132 -165.6676 -164.5982 -162.5236 -162.9440
#> [78] -165.0077 -164.6637 -165.5166 -161.6400 -164.2965 -173.0735 -170.3296
#> [85] -164.3697 -163.9055 -158.3979 -163.9402 -162.0451 -159.9783 -165.6694
#> [92] -165.3186 -168.9189 -159.6287 -162.6181 -163.8039 -158.3842 -162.6906
#> [99] -160.9763 -159.8244 -162.8366 -162.5963 -163.4658 -167.3131 -164.6421
#> [106] -164.6833 -161.7303 -164.8150 -160.0667 -166.5534 -163.9204 -164.7804
#> [113] -167.9746 -166.6208 -163.6196 -162.7333 -162.0322 -160.7680 -161.8001
#> [120] -165.8896 -163.1554 -159.5067 -158.7226 -162.8391 -163.8914 -161.2408
#> [127] -166.9065 -163.3945 -164.9210 -164.6281 -164.5335 -163.9288 -159.7256
#> [134] -161.8899 -167.9784 -164.2789 -164.8609 -166.1609 -165.0404 -161.3281
#> [141] -166.0640 -161.2202 -168.3708 -161.9194 -159.9884 -164.3115 -161.2833
#> [148] -163.2938 -162.3712 -163.1992 -161.8762 -161.2224 -161.4824 -161.6941
#> [155] -164.2133 -166.8482 -170.6795 -163.2973 -163.3445 -161.2491 -162.1885
#> [162] -167.3142 -165.9562 -166.9267 -165.5980 -161.7118 -162.7740 -164.6417
#> [169] -160.8438 -158.8578 -160.3191 -162.8446 -160.6739 -169.3928 -165.9248
#> [176] -159.4164 -164.3555 -163.6986 -162.5244 -172.1248 -163.2072 -160.9204
#> [183] -160.4847 -165.9380 -164.4353 -160.1979 -161.8194 -160.6288 -165.1915
#> [190] -165.0666 -161.0911 -161.3034 -163.0492 -161.1111 -159.3026 -157.9837
#> [197] -160.1741 -164.4247 -163.6295 -161.0259
#>
#> $BIC
#> BIC
#> 354.1922
#>
#> $acceptance.rate
#> [1] 0.4864151
We can quantify the absolute reconstruction error associated with estimating the three components of the tree, \(Z\), \(P^s\), and \(P^b\) by finding the minimum absolute distance between the inference and all permutations of the truth.
ez<-error_z(true.Z=sims.out$true.tree$Z, inferred.Z=sim.best.tree$tree$Z)
eps<-error_ps(true.Ps=sims.out$true.tree$Ps, inferred.Ps=sim.best.tree$tree$Ps)
epb<-error_pb(true.Pb=sims.out$true.tree$Pb, inferred.Pb=sim.best.tree$tree$Pb)
Thus, there is 5% error in estimating the clonal configuration matrix, \(Z\), 23% error in estimating the single-cell to clone assignment matrix, \(P^s\), and 15% error in estimating the bulk sample to clone assignment matrix, \(P^b\), which is reasonable given the small number of MCMC iterations and chains.
In this example, we apply Canopy2 to data from a patients with glioblastoma, a cancer of the brain. The package contains sample datasets from three patients: GBM2, GBM9, and GBM10. GBM10 is the smallest of these three and thus used for demonstration purposes. We also make data available from two patients with breast cancer (BC), BC03 and BC07.
All datasets are saved in the pre- and post-processed forms. For example, GBM10 is available in pre-processed form as GBM10_preproc and post-processed form as GBM10_postproc. For this particular example, we will use the post-processed data, and we refer the user to the subsequent section Data processing where we discuss the differences between the pre- and post-processed forms and how to conduct the post-processing.
First, we load the post-processed data for patient GBM10
data("GBM10_postproc")
We then run Canopy2 to obtain a list of phylogenetic trees corresponding to all chains and all subclones. We experiment with 4-8 subclones (Klist=4:8) and a small number of MCMC iterations (niter=10000) for demonstration purposes. We suggest increasing the number of iterations substantially (e.g., 50k or 100k) when conducting your own real data analysis and running on a computing cluster.
get.trees.out<-get_trees(Rs=GBM10_postproc@Rs, Rb=GBM10_postproc@Rb,
Xs=GBM10_postproc@Xs, Xb=GBM10_postproc@Xb,
alpha=GBM10_postproc@param.est$alpha,
beta=GBM10_postproc@param.est$beta, kappa=kappa,
tau=tau, Klist=4:8, niter=10000, nchains=10,
thin=thin, pburn=pburn, seed=8675309)
#> [1] "Executing code in parallel using 10 cores."
A quick look at the diagnostic plots for reveals that the posteriors are well-aligned in space for K=6, the lag ACF is low (below the 0.3 threshold) at lag 5 or 10 (except for chain 6), and the trace plots appear as random noise. Overall, we don’t observe anything of concern in the diagnostic plots.
get_diagnostics(get.trees.out)
Finally, we want to determine the optimal number of subclones across the range ofK=4:8. It appears from the plot that BIC is minimized for K=6, so we select this as the final configuration.
best.tree.out<-get_best_tree(get.trees.out)
#> [1] "k=4; mean of maximized log-posteriors across all chains=-412.15; BIC=933.18"
#> [1] "k=5; mean of maximized log-posteriors across all chains=-412.15; BIC=891.67"
#> [1] "k=6; mean of maximized log-posteriors across all chains=-412.15; BIC=868.99"
#> [1] "k=7; mean of maximized log-posteriors across all chains=-412.15; BIC=873.28"
#> [1] "k=8; mean of maximized log-posteriors across all chains=-412.15; BIC=880.91"
#> Warning: Use of `temp.df$x1` is discouraged.
#> ℹ Use `x1` instead.
#> Warning: Use of `temp.df$x2` is discouraged.
#> ℹ Use `x2` instead.
#> Warning: Use of `temp.df$x3` is discouraged.
#> ℹ Use `x3` instead.